Session 6b: Tidying data
January 20, 2023
Note how you already know the basics of a number of these steps!
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10
tidyr package, which is part of the tidyverse.“all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000”
(from ?tidyr::table1)
If you want to inspect the data yourself, you can access them via table1, table2 and table3.
[1] "table1"
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
[1] "table3"
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
[1] "table2"
# A tibble: 6 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
table1, you were right!table4a and table4b from tidyr)[1] "table4a"
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
[1] "table4b"
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
Having one consistent format for data makes it easier to learn the tools required for analysis (which can have a certain uniformity). The tidyverse packages, for example, are designed to work with tidy data (who would have thought! 🤣)
It is advantageous for variables to be placed in columns because this caters to R’s vectorised nature. (Most R-functions work with vectors of values.)
pivot_longer() and pivot_wider() in tidyr! Note: this doesn’t mean that non-tidy data are “bad”. There can be many reasons for why a dataset is in a non-tidy format, e.g. ease of data entry if this is being done manually.
Common problem: column names are values of a variable rather than variables
Example: table4a
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
Solution: pivot these columns to new variables, rendering the dataset longer
We need:
1999 and 2000)year)cases)from R4DS
Exercise: try doing the same thing with table4b!
We can easily join the longer versions of table4a and table4b using left_join() (more on joining operations later):
# A tibble: 6 × 4
country year cases population
<chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
clean_names(), which cleans up problematic variable names (e.g. names with spaces, starting with a digit etc.)pivot_wider() is the counterpart of pivot_longer() which you need when observations are spread across multiple rows such as in table2# A tibble: 6 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
To tidy table2 we need:
type)count)# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
from R4DS
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table3, the rate column contains both cases and populationseparate() functionseparate is clever enough to correctly guess the delimiting character – it looks for a non-alphanumeric character by default (to specify it manually, use sep = "/")# A tibble: 6 × 4
country year cases population
<chr> <int> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
separate retains the original column type (character in this case)convert parameter# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583